Probabilistic Model for Structured Document Mapping Application to Automatic HTML to XML Conversion
نویسندگان
چکیده
We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machine learning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. This model relies on the concept of meta-document which is a latent variable providing a link between input and target documents. It allows us to learn the correspondences when the input documents are expressed in a large variety of schemas. We then detail an instance of the general model for the particular task of HTML to XML conversion. This instance is tested on three different corpora using two different inference methods: a dynamic programming method and an approximate LaSO-based method.
منابع مشابه
From Layout to Semantic: a Reranking Model for Mapping Web Documents to Mediated XML Representations
Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document con...
متن کاملProbabilistic models for the dynamics of tree-structured data
Many information sources on the web are generated by scripts and are highly structured, e.g., a movie website like IMDB. While these documents share a common HTML tree structure (or XML schema), the structure is not static and changes over time. The temporal changes tend to break many information extraction tools such as Wrappers which depend on precise knowledge of the structures. In this pape...
متن کاملAn XML-based Generic Architecture for the Construction of Interactive Web Components
A generic architecture for constructing interactive web-based components or subsystems is described. Examples of component types are custom content navigation and presentation components, interactive exercise and testing components, and an electronic catalogue and ordering system component. The instances of a component type are collections of structured data which share a common syntactic and s...
متن کاملPrinting Structured Text without Stylesheets?
As more and more XML documents start to appear, e.g. on the WWW, the users face a new problem: opposite to HTML tags, XML tags do not tell the semantics of a structure element. This means that if a document does not come with layout, e.g. XSL or CSS, specifications, it is not easy to say how the document should be formatted for presentation in print or on screen. In this paper we describe a too...
متن کاملBiblet: A portable BibTEX bibliography style for generating highly customizable XHTML
We present Biblet, a set of BibTEX bibliography styles (bst) which generate XHTML from BibTEX databases. Unlike other BibTEX to XML/HTML converters, Biblet is written entirely in the native BibTEX style language and therefore works “out of the box” on any system that runs BibTEX. Features include automatic conversion of LTEX symbols to HTML or Unicode entities; customizable graphical hyperlinks...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007